Whats Covered
- Mutating joins
- Filtering joins and set operations
- Assembling data
- Advanced joining
- Case Study
Aditional Resources
Case study
Lahman’s Baseball Database
- library(Lahman)
- Sean Lahman’s baseball database
- 26 tables
- A great example of real worl relational data
This database contains pitching, hitting, and fielding statistics for Major League Baseball from 1871 through 2016. It includes data from the two current leagues (American and National), the four other “major” leagues (American Association, Union Association, Players League, and Federal League), and the National Association of 1871-1875.
This database was created by Sean Lahman, who pioneered the effort
to make baseball statistics freely available to the general
public. What started as a one man effort in 1994 has grown
tremendously, and now a team of researchers have collected their
efforts to make this the largest and most accurate source for
baseball statistics available anywhere.
This database, in the form of an R package offers a variety of
interesting challenges and opportunities for data processing and
visualization in R.
In the current version, the examples make extensive use of the
'dplyr' package for data manipulation (tabulation, queries,
summaries, merging, etc.), reflecting the original relational
database design and 'ggplot2' for graphics.
**Details:**
Package: Lahman
Type: Package
Version: 6.0-0
Date: 2017-07-04
License: GPL version 2 or newer
LazyLoad: yes
LazyData: yes
The main form of this database is a relational database in
Microsoft Access format. The design follows these general
principles: Each player is assigned a unique code ('playerID').
All of the information in different tables relating to that player
is tagged with his 'playerID'. The 'playerID's are linked to
names and birthdates in the 'Master' table. Similar links exist
among other tables via analogous '*ID' variables.
The database is composed of the following main tables:
'Master' Player names, dates of birth, death and other
biographical info
'Batting' batting statistics
'Pitching' pitching statistics
'Fielding' fielding statistics
A collection of other tables is also provided:
Teams:
'Teams' yearly stats and standings
'TeamsHalf' split season data for teams
'TeamsFranchises' franchise information
Post-season play:
'BattingPost' post-season batting statistics
'PitchingPost' post-season pitching statistics
'FieldingPost' post-season fielding data
'SeriesPost' post-season series information
Awards:
'AwardsManagers' awards won by managers
'AwardsPlayers' awards won by players
'AwardsShareManagers' award voting for manager awards
'AwardsSharePlayers' award voting for player awards
Hall of Fame: links to Master via 'hofID'
'HallOfFame' Hall of Fame voting data
Other tables:
'AllstarFull' - All-Star games appearances; 'Managers' -
managerial statistics; 'FieldingOF' - outfield position data;
'ManagersHalf' - split season data for managers; 'Salaries' -
player salary data; 'Appearances' - data on player appearances;
'Schools' - Information on schools players attended;
'CollegePlaying' - Information on schools players attended, by
player and year;
Variable label tables are provided for some of the tables:
'battingLabels', 'pitchingLabels', 'fieldingLabels'
Universal keys?
- Before starting to analyze the data, you’ll examine how the datasets are related to each other.
- In particular, do any variable names span all of the datasets?
library(purrr)
# Examine lahmanNames
lahmanNames## $AllstarFull
## # A tibble: 8 x 1
## var
## <chr>
## 1 playerID
## 2 yearID
## 3 gameNum
## 4 gameID
## 5 teamID
## 6 lgID
## 7 GP
## 8 startingPos
##
## $Appearances
## # A tibble: 21 x 1
## var
## <chr>
## 1 yearID
## 2 teamID
## 3 lgID
## 4 playerID
## 5 G_all
## 6 GS
## 7 G_batting
## 8 G_defense
## 9 G_p
## 10 G_c
## # ... with 11 more rows
##
## $AwardsManagers
## # A tibble: 6 x 1
## var
## <chr>
## 1 playerID
## 2 awardID
## 3 yearID
## 4 lgID
## 5 tie
## 6 notes
##
## $AwardsPlayers
## # A tibble: 6 x 1
## var
## <chr>
## 1 playerID
## 2 awardID
## 3 yearID
## 4 lgID
## 5 tie
## 6 notes
##
## $AwardsShareManagers
## # A tibble: 7 x 1
## var
## <chr>
## 1 awardID
## 2 yearID
## 3 lgID
## 4 playerID
## 5 pointsWon
## 6 pointsMax
## 7 votesFirst
##
## $AwardsSharePlayers
## # A tibble: 7 x 1
## var
## <chr>
## 1 awardID
## 2 yearID
## 3 lgID
## 4 playerID
## 5 pointsWon
## 6 pointsMax
## 7 votesFirst
##
## $Batting
## # A tibble: 22 x 1
## var
## <chr>
## 1 playerID
## 2 yearID
## 3 stint
## 4 teamID
## 5 lgID
## 6 G
## 7 AB
## 8 R
## 9 H
## 10 X2B
## # ... with 12 more rows
##
## $battingLabels
## # A tibble: 2 x 1
## var
## <chr>
## 1 variable
## 2 label
##
## $BattingPost
## # A tibble: 22 x 1
## var
## <chr>
## 1 yearID
## 2 round
## 3 playerID
## 4 teamID
## 5 lgID
## 6 G
## 7 AB
## 8 R
## 9 H
## 10 X2B
## # ... with 12 more rows
##
## $CollegePlaying
## # A tibble: 3 x 1
## var
## <chr>
## 1 playerID
## 2 schoolID
## 3 yearID
##
## $Fielding
## # A tibble: 18 x 1
## var
## <chr>
## 1 playerID
## 2 yearID
## 3 stint
## 4 teamID
## 5 lgID
## 6 POS
## 7 G
## 8 GS
## 9 InnOuts
## 10 PO
## 11 A
## 12 E
## 13 DP
## 14 PB
## 15 WP
## 16 SB
## 17 CS
## 18 ZR
##
## $fieldingLabels
## # A tibble: 2 x 1
## var
## <chr>
## 1 variable
## 2 label
##
## $FieldingOF
## # A tibble: 6 x 1
## var
## <chr>
## 1 playerID
## 2 yearID
## 3 stint
## 4 Glf
## 5 Gcf
## 6 Grf
##
## $FieldingPost
## # A tibble: 17 x 1
## var
## <chr>
## 1 playerID
## 2 yearID
## 3 teamID
## 4 lgID
## 5 round
## 6 POS
## 7 G
## 8 GS
## 9 InnOuts
## 10 PO
## 11 A
## 12 E
## 13 DP
## 14 TP
## 15 PB
## 16 SB
## 17 CS
##
## $HallOfFame
## # A tibble: 9 x 1
## var
## <chr>
## 1 playerID
## 2 yearID
## 3 votedBy
## 4 ballots
## 5 needed
## 6 votes
## 7 inducted
## 8 category
## 9 needed_note
##
## $LahmanData
## # A tibble: 5 x 1
## var
## <chr>
## 1 file
## 2 class
## 3 nobs
## 4 nvar
## 5 title
##
## $Managers
## # A tibble: 10 x 1
## var
## <chr>
## 1 playerID
## 2 yearID
## 3 teamID
## 4 lgID
## 5 inseason
## 6 G
## 7 W
## 8 L
## 9 rank
## 10 plyrMgr
##
## $ManagersHalf
## # A tibble: 10 x 1
## var
## <chr>
## 1 playerID
## 2 yearID
## 3 teamID
## 4 lgID
## 5 inseason
## 6 half
## 7 G
## 8 W
## 9 L
## 10 rank
##
## $Master
## # A tibble: 26 x 1
## var
## <chr>
## 1 playerID
## 2 birthYear
## 3 birthMonth
## 4 birthDay
## 5 birthCountry
## 6 birthState
## 7 birthCity
## 8 deathYear
## 9 deathMonth
## 10 deathDay
## # ... with 16 more rows
##
## $Pitching
## # A tibble: 30 x 1
## var
## <chr>
## 1 playerID
## 2 yearID
## 3 stint
## 4 teamID
## 5 lgID
## 6 W
## 7 L
## 8 G
## 9 GS
## 10 CG
## # ... with 20 more rows
##
## $pitchingLabels
## # A tibble: 2 x 1
## var
## <chr>
## 1 variable
## 2 label
##
## $PitchingPost
## # A tibble: 30 x 1
## var
## <chr>
## 1 playerID
## 2 yearID
## 3 round
## 4 teamID
## 5 lgID
## 6 W
## 7 L
## 8 G
## 9 GS
## 10 CG
## # ... with 20 more rows
##
## $Salaries
## # A tibble: 5 x 1
## var
## <chr>
## 1 yearID
## 2 teamID
## 3 lgID
## 4 playerID
## 5 salary
##
## $Schools
## # A tibble: 5 x 1
## var
## <chr>
## 1 schoolID
## 2 name_full
## 3 city
## 4 state
## 5 country
##
## $SeriesPost
## # A tibble: 9 x 1
## var
## <chr>
## 1 yearID
## 2 round
## 3 teamIDwinner
## 4 lgIDwinner
## 5 teamIDloser
## 6 lgIDloser
## 7 wins
## 8 losses
## 9 ties
##
## $Teams
## # A tibble: 48 x 1
## var
## <chr>
## 1 yearID
## 2 lgID
## 3 teamID
## 4 franchID
## 5 divID
## 6 Rank
## 7 G
## 8 Ghome
## 9 W
## 10 L
## # ... with 38 more rows
##
## $TeamsFranchises
## # A tibble: 4 x 1
## var
## <chr>
## 1 franchID
## 2 franchName
## 3 active
## 4 NAassoc
##
## $TeamsHalf
## # A tibble: 10 x 1
## var
## <chr>
## 1 yearID
## 2 lgID
## 3 teamID
## 4 Half
## 5 divID
## 6 DivWin
## 7 Rank
## 8 G
## 9 W
## 10 L
# Find variables in common
reduce(lahmanNames, intersect)## # A tibble: 0 x 1
## # ... with 1 variable: var <chr>
No variables span all 26 datasets.
Common keys
No variables span all 26 datasets, but several variables span more than one dataset. These variables provide relationships between the various tables in Lahman. Can you find them?
lahmanNames %>%
# Bind the data frames in lahmanNames
bind_rows(.id = 'dataframe') %>%
# Group the result by var
group_by(var) %>%
# Tally the number of appearances
tally() %>%
# Filter the data
filter(n > 2) %>%
# Arrange the results
arrange(desc(n)) %>%
# Create pretty table
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
row_spec(0, bold = T, color = "white", background = "#3f7689")| var | n |
|---|---|
| yearID | 21 |
| playerID | 19 |
| lgID | 17 |
| teamID | 13 |
| G | 10 |
| L | 6 |
| W | 6 |
| BB | 5 |
| CS | 5 |
| GS | 5 |
| H | 5 |
| HBP | 5 |
| HR | 5 |
| R | 5 |
| SB | 5 |
| SF | 5 |
| SO | 5 |
| awardID | 4 |
| GIDP | 4 |
| IBB | 4 |
| round | 4 |
| SH | 4 |
| stint | 4 |
| AB | 3 |
| CG | 3 |
| DP | 3 |
| E | 3 |
| ER | 3 |
| ERA | 3 |
| IPouts | 3 |
| label | 3 |
| SHO | 3 |
| SV | 3 |
| variable | 3 |
| WP | 3 |
| X2B | 3 |
| X3B | 3 |
yearID, playerID, lgID, and teamID are the most common variable names.
playerID
In the last exercise, you saw that playerID is one of the most widely used keys in the Lahman database. Which datasets use playerID?
lahmanNames %>%
# Bind the data frames
bind_rows(.id = 'dataframe') %>%
# Filter the results
filter(var == "playerID") %>%
# Extract the dataframe variable
`$`(dataframe)## [1] "AllstarFull" "Appearances" "AwardsManagers"
## [4] "AwardsPlayers" "AwardsShareManagers" "AwardsSharePlayers"
## [7] "Batting" "BattingPost" "CollegePlaying"
## [10] "Fielding" "FieldingOF" "FieldingPost"
## [13] "HallOfFame" "Managers" "ManagersHalf"
## [16] "Master" "Pitching" "PitchingPost"
## [19] "Salaries"
The playerID variable appears in 19 of the 26 datasets in Lahman.
Salaries
Who are the players?
Before we look at salaries, let’s begin by ensuring that we have salary information for each player in the database, or at least no systematic holes in our coverage.
To do this we will need a list of every player to compare against salaries. The Master dataset contains all of the players in the database, but it may contain multiple rows for each player.
How should you proceed?
We saw how to use distinct() to find unique rows for all columns in a table. You can also find unique rows for specific columns using thefollowing syntax:
tbl %>%
# Find unique rows of columns a,b, and c
distinct(a, b, c)library(Lahman)players <- Master %>%
# Return one row for each distinct player
distinct(playerID, nameFirst, nameLast)
players %>%
head(10) %>%
# Create pretty table
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
row_spec(0, bold = T, color = "white", background = "#3f7689")| playerID | nameFirst | nameLast |
|---|---|---|
| aardsda01 | David | Aardsma |
| aaronha01 | Hank | Aaron |
| aaronto01 | Tommie | Aaron |
| aasedo01 | Don | Aase |
| abadan01 | Andy | Abad |
| abadfe01 | Fernando | Abad |
| abadijo01 | John | Abadie |
| abbated01 | Ed | Abbaticchio |
| abbeybe01 | Bert | Abbey |
| abbeych01 | Charlie | Abbey |
players contains a concise list of player IDs and names. Plus we can feel certain that players contains only one row for each distinct player.
Missing salaries
Now that we know who the players are, let’s check our salary coverage. How many players are completely missing salary information?
Salaries %>%
head(10) %>%
# Create pretty table
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
row_spec(0, bold = T, color = "white", background = "#3f7689")| yearID | teamID | lgID | playerID | salary |
|---|---|---|---|---|
| 1985 | ATL | NL | barkele01 | 870000 |
| 1985 | ATL | NL | bedrost01 | 550000 |
| 1985 | ATL | NL | benedbr01 | 545000 |
| 1985 | ATL | NL | campri01 | 633333 |
| 1985 | ATL | NL | ceronri01 | 625000 |
| 1985 | ATL | NL | chambch01 | 800000 |
| 1985 | ATL | NL | dedmoje01 | 150000 |
| 1985 | ATL | NL | forstte01 | 483333 |
| 1985 | ATL | NL | garbege01 | 772000 |
| 1985 | ATL | NL | harpete01 | 250000 |
players %>%
# Find all players who do not appear in Salaries
anti_join(Salaries, by = "playerID") %>%
# Count them
count()## # A tibble: 1 x 1
## n
## <int>
## 1 14468
We are missing the salaries for 14,468 players. No wonder baseball players strike so often!
Unpaid games?
Now that we know there is a huge hole in the salary data, let’s see if we can explain it. Is it possible that these players somehow did not play (and hence did not earn a salary)?
We can check with the Appearances data frame. Appearances contains information about every game played in major league baseball. That is, if a player played a game, it would show up as a row in Appearances.
Appearances %>%
head(10) %>%
# Create pretty table
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
row_spec(0, bold = T, color = "white", background = "#3f7689")| yearID | teamID | lgID | playerID | G_all | GS | G_batting | G_defense | G_p | G_c | G_1b | G_2b | G_3b | G_ss | G_lf | G_cf | G_rf | G_of | G_dh | G_ph | G_pr |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1871 | TRO | NA | abercda01 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1871 | RC1 | NA | addybo01 | 25 | 25 | 25 | 25 | 0 | 0 | 0 | 22 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1871 | CL1 | NA | allisar01 | 29 | 29 | 29 | 29 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 29 | 0 | 29 | 0 | 0 | 0 |
| 1871 | WS3 | NA | allisdo01 | 27 | 27 | 27 | 27 | 0 | 27 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1871 | RC1 | NA | ansonca01 | 25 | 25 | 25 | 25 | 0 | 5 | 1 | 2 | 20 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1871 | FW1 | NA | armstbo01 | 12 | 12 | 12 | 12 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 11 | 1 | 12 | 0 | 0 | 0 |
| 1871 | RC1 | NA | barkeal01 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1871 | BS1 | NA | barnero01 | 31 | 31 | 31 | 31 | 0 | 0 | 0 | 16 | 0 | 15 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1871 | FW1 | NA | barrebi01 | 1 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1871 | BS1 | NA | barrofr01 | 18 | 17 | 18 | 18 | 0 | 0 | 0 | 1 | 0 | 0 | 13 | 0 | 4 | 17 | 0 | 0 | 0 |
players %>%
anti_join(Salaries, by = "playerID") %>%
# How many unsalaried players appear in Appearances?
semi_join(Appearances, by = "playerID") %>%
count()## # A tibble: 1 x 1
## n
## <int>
## 1 14279
14,279 players played a game but are missing salary information. Interestingly, 191 players neither played a game nor have a recorded salary.
How many games?
Perhaps the unsalaried players only played one or two games, and hence did not earn a full salary. Can you determine how many games each of these unsalaried players played?
players %>%
# Find all players who do not appear in Salaries
anti_join(Salaries, by = "playerID") %>%
# Join them to Appearances
left_join(Appearances, by = "playerID") %>%
# Calculate total_games for each player
group_by(playerID) %>%
summarize(total_games=sum(G_all, is.na = FALSE)) %>%
# Arrange in descending order by total_games
arrange(desc(total_games))## # A tibble: 14,468 x 2
## playerID total_games
## <chr> <int>
## 1 yastrca01 3308
## 2 aaronha01 3298
## 3 cobbty01 3034
## 4 musiast01 3026
## 5 mayswi01 2992
## 6 robinbr01 2896
## 7 kalinal01 2834
## 8 collied01 2825
## 9 robinfr02 2808
## 10 wagneho01 2797
## # ... with 14,458 more rows
Many of these players appeared in thousands of games, which rules out our hypothesis that they did not appear in enough games to earn a salary.
How many at-bats?
Is it possible that the unsalaried players did not actually play in the games that they appeared in? One way to check would be to determine if the players had an at-bat (i.e. batted) in the games that they appeared in.
players %>%
# Find unsalaried players
anti_join(Salaries, by = "playerID") %>%
# Join Batting to the unsalaried players
left_join(Batting, by = "playerID") %>%
# Group by player
group_by(playerID) %>%
# Sum at-bats for each player
summarize(total_at_bat = sum(AB, is.na = FALSE)) %>%
# Arrange in descending order
arrange(desc(total_at_bat))## # A tibble: 14,468 x 2
## playerID total_at_bat
## <chr> <int>
## 1 aaronha01 12364
## 2 yastrca01 11988
## 3 cobbty01 11435
## 4 musiast01 10972
## 5 mayswi01 10881
## 6 robinbr01 10654
## 7 wagneho01 10439
## 8 brocklo01 10332
## 9 ansonca01 10281
## 10 aparilu01 10230
## # ... with 14,458 more rows
The unpaid players definitely participated in the games. In fact, you can spot Ty Cobb and Hank Aaron in the top three at-bats. I think it’s safe to assume that you are dealing with missing data here and not unsalaried players.
Introducing the hall of fame
Hall of fame nominations
The Hall of Fame is a collection of distinguished baseball players selected by a committee of baseball experts. As with any hall of fame, more players are nominated for membership than are actually admitted.
- Let’s see how many players have been nominated for the Hall of Fame.
# Find the distinct players that appear in HallOfFame
nominated <- HallOfFame %>%
distinct(playerID)
nominated %>%
# Count the number of players in nominated
count()## # A tibble: 1 x 1
## n
## <int>
## 1 1279
nominated_full <- nominated %>%
# Join to Master
left_join(Master, by = "playerID") %>%
# Return playerID, nameFirst, nameLast
select(playerID, nameFirst, nameLast)There were 1,279 nominees for the Hall of Fame. We now have a dataset of everyone nominated! Let’s create a dataset for everyone inducted to the Hall of Fame in the next exercise.
Hall of fame inductions
In the previous exercise, we saw that 1,279 players were nominated for the hall of fame. Let’s now see how many players were admitted to the hall of fame to examine how selective the voting process is.
# Find distinct players in HallOfFame with inducted == "Y"
inducted <- HallOfFame %>%
filter(inducted == "Y") %>%
distinct(playerID)
inducted %>%
# Count the number of players in inducted
count()## # A tibble: 1 x 1
## n
## <int>
## 1 323
inducted_full <- inducted %>%
# Join to Master
left_join(Master, by = "playerID") %>%
# Return playerID, nameFirst, nameLast
select(playerID, nameFirst, nameLast)323 players have been inducted into the Hall of Fame out of 1,279 nominees. We now also have datasets of everyone inducted and everyone nominated.
Awards
Now that we know who was inducted and who was nominated, let’s examine what separates the nominees who were inducted from the nominees who were not.
Let’s start with a simple question: Did nominees who were inducted earn more awards than nominees who were not inducted?
We can use AwardsPlayers to answer the question. It lists the playerID’s of players who won baseball awards, and it contains one row for each award awarded in major league baseball.
AwardsPlayers %>%
head(10) %>%
# Create pretty table
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
row_spec(0, bold = T, color = "white", background = "#3f7689")| playerID | awardID | yearID | lgID | tie | notes |
|---|---|---|---|---|---|
| bondto01 | Pitching Triple Crown | 1877 | NL | NA | NA |
| hinespa01 | Triple Crown | 1878 | NL | NA | NA |
| heckegu01 | Pitching Triple Crown | 1884 | AA | NA | NA |
| radboch01 | Pitching Triple Crown | 1884 | NL | NA | NA |
| oneilti01 | Triple Crown | 1887 | AA | NA | NA |
| keefeti01 | Pitching Triple Crown | 1888 | NL | NA | NA |
| clarkjo01 | Pitching Triple Crown | 1889 | NL | NA | NA |
| rusieam01 | Pitching Triple Crown | 1894 | NL | NA | NA |
| duffyhu01 | Triple Crown | 1894 | NL | NA | NA |
| youngcy01 | Pitching Triple Crown | 1901 | AL | NA | NA |
# Tally the number of awards in AwardsPlayers by playerID
nAwards <- AwardsPlayers %>%
group_by(playerID) %>%
tally()
nAwards %>%
# Filter to just the players in inducted
inner_join(inducted, by = "playerID") %>%
# Calculate the mean number of awards per player
summarize(avg_n = mean(n, na.rm = TRUE))## # A tibble: 1 x 1
## avg_n
## <dbl>
## 1 12.0
nAwards %>%
# Filter to just the players in nominated
inner_join(nominated, by = "playerID") %>%
# Filter to players NOT in inducted
anti_join(inducted, by = "playerID") %>%
# Calculate the mean number of awards per player
summarize(avg_n = mean(n, na.rm = TRUE))## # A tibble: 1 x 1
## avg_n
## <dbl>
## 1 4.21
On Average, inductees had 12.04 - 4.21 = 7.83 more awards than non-inductees.
Salary
Salary may provide another way to differentiate inductees from non-inductees. Does the maximum salary earned by inductees tend to be greater than the maximum salary earned by nominees who were not inducted?
# Find the players who are in nominated, but not inducted
notInducted <- nominated %>%
setdiff(inducted)
Salaries %>%
# Find the players who are in notInducted
semi_join(notInducted, by = "playerID") %>%
# Calculate the max salary by player
group_by(playerID) %>%
summarize(max_salary = max(salary, na.rm = TRUE)) %>%
# Calculate the average of the max salaries
summarize(avg_salary = mean(max_salary, na.rm = TRUE))## # A tibble: 1 x 1
## avg_salary
## <dbl>
## 1 5453789.
# Repeat for players who were inducted
Salaries %>%
semi_join(inducted, by = "playerID") %>%
group_by(playerID) %>%
summarize(max_salary = max(salary, na.rm = TRUE)) %>%
summarize(avg_salary = mean(max_salary, na.rm = TRUE))## # A tibble: 1 x 1
## avg_salary
## <dbl>
## 1 6666063.
It turns out that the average salary of players who were inducted was $5,079,720 - $4,677,737 = $401,983 more per year. We know that we have some missing salary information, but what else could be wrong with this analysis?
Retirement
One of the rules of the Hall of Fame is that players cannot be nominated until five years after they retire. Is this reflected in our data?
Appearances %>%
# Filter Appearances against nominated
semi_join(nominated, by = "playerID") %>%
# Find last year played by player
group_by(playerID) %>%
summarize(last_year = max(yearID)) %>%
# Join to full HallOfFame
left_join(HallOfFame, by = "playerID") %>%
# Filter for unusual observations
filter(yearID <= last_year)## # A tibble: 39 x 10
## playerID last_year yearID votedBy ballots needed votes inducted category
## <chr> <dbl> <int> <chr> <int> <int> <int> <fct> <fct>
## 1 cissebi~ 1938 1937 BBWAA 201 151 1 N Player
## 2 cochrmi~ 1937 1936 BBWAA 226 170 80 N Player
## 3 deandi01 1947 1945 BBWAA 247 186 17 N Player
## 4 deandi01 1947 1946 Final ~ 263 198 45 N Player
## 5 deandi01 1947 1946 Nomina~ 202 NA 40 N Player
## 6 deandi01 1947 1947 BBWAA 161 121 88 N Player
## 7 dickebi~ 1946 1945 BBWAA 247 186 17 N Player
## 8 dickebi~ 1946 1946 Nomina~ 202 NA 40 N Player
## 9 dickebi~ 1946 1946 Final ~ 263 198 32 N Player
## 10 dimagjo~ 1951 1945 BBWAA 247 186 1 N Player
## # ... with 29 more rows, and 1 more variable: needed_note <chr>
It looks like quite a few players have been nominated before they retired, but this practice seems much less frequent in recent years.
Session info
sessionInfo()## R version 3.5.2 (2018-12-20)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 16299)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=German_Switzerland.1252 LC_CTYPE=German_Switzerland.1252
## [3] LC_MONETARY=German_Switzerland.1252 LC_NUMERIC=C
## [5] LC_TIME=German_Switzerland.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] purrr_0.3.0 ggplot2_3.1.0 dplyr_0.8.0.1 gapminder_0.3.0
## [5] kableExtra_1.0.1 knitr_1.21
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.0 highr_0.7 plyr_1.8.4
## [4] pillar_1.3.1 compiler_3.5.2 prettydoc_0.2.1
## [7] tools_3.5.2 digest_0.6.18 gtable_0.2.0
## [10] evaluate_0.12 tibble_2.0.1 viridisLite_0.3.0
## [13] pkgconfig_2.0.2 rlang_0.3.1 cli_1.0.1
## [16] rstudioapi_0.9.0 yaml_2.2.0 xfun_0.4
## [19] withr_2.1.2 httr_1.4.0 stringr_1.4.0
## [22] xml2_1.2.0 hms_0.4.2 webshot_0.5.1
## [25] grid_3.5.2 tidyselect_0.2.5 glue_1.3.0
## [28] R6_2.4.0 fansi_0.4.0 rmarkdown_1.11
## [31] readr_1.3.1 magrittr_1.5 codetools_0.2-15
## [34] scales_1.0.0 htmltools_0.3.6 assertthat_0.2.0
## [37] rvest_0.3.2 colorspace_1.4-0 utf8_1.1.4
## [40] stringi_1.3.1 lazyeval_0.2.1 munsell_0.5.0
## [43] crayon_1.3.4